Cancer detection and classification using methylome analysis

ABSTRACT

There is described herein a method of detecting the presence of DNA from cancer cells in a subject comprising: providing a sample of cell-free DNA from a subject; subjecting the sample to library preparation to permit subsequent sequencing of the cell-free methylated DNA; adding a first amount of filler DNA to the sample, wherein at least a portion of the filler DNA is methylated, then optionally denaturing the sample; capturing cell-free methylated DNA using a binder selective for methylated polynucleotides; sequencing the captured cell-free methylated DNA; comparing the sequences of the captured cell-free methylated DNA to control cell-free methylated DNAs sequences from healthy and cancerous individuals and from individuals with distinct cancer types and subtypes; identifying the presence of DNA from cancer cells if there is a statistically significant similarity between one or more sequences of the captured cell-free methylated DNA and cell-free methylated DNAs sequences from cancerous individuals.

RELATED APPLICATIONS

This application is a continuation of U.S. patent application Ser. No.16/630,299 filed Jan. 10, 2020, which is a 371 Application ofInternational Application No. PCT/CA2018/000141, filed Jul. 11, 2018,which claims priority to U.S. Provisional Patent Application No.62/531,527 filed Jul. 12, 2017, each of which is hereby incorporated byreference in its entirety.

FIELD OF THE INVENTION

The invention relates to cancer detection and classification and moreparticularly to the use of methylome analysis for the same.

BACKGROUND OF THE INVENTION

The use of circulating cell-free DNA (cfDNA) as a source of biomarkersis rapidly gaining momentum in oncology[1]. Use of DNA methylationmapping of cfDNA as a biomarker could have a significant impact in thefield of liquid biopsy, as it could allow for the identification of thetissue-of-origin[2], allow for cancer type and subtype classification,and stratify cancer patients in a minimally invasive fashion[3].Furthermore, using genome-wide DNA methylation mapping of cfDNA couldovercome a critical sensitivity problem in detecting circulating tumorDNA (ctDNA) in patients with early-stage cancer with no radiographicevidence of disease. Existing ctDNA detection methods are based onsequencing mutations and have limited sensitivity in part due to thelimited number of recurrent mutations available to distinguish betweentumor and normal circulating cfDNA[4, 5]. On the other hand, genome-wideDNA methylation mapping leverages large numbers of epigeneticalterations that may be used to distinguish circulating tumor DNA(ctDNA) from normal circulating cell-free DNA (cfDNA). For example, sometumor types, such as ependymomas, can have extensive DNA methylationaberrations without any significant recurrent somatic mutations[6].

Certain methods of capturing cell-free methylated DNA are described inWO 2017/190215, which is incorporated by reference.

SUMMARY OF THE INVENTION

In an aspect, there is provided a method of detecting the presence ofDNA from cancer cells in a subject comprising: providing a sample ofcell-free DNA from a subject; subjecting the sample to librarypreparation to permit subsequent sequencing of the cell-free methylatedDNA; adding a first amount of filler DNA to the sample, wherein at leasta portion of the filler DNA is methylated, then optionally denaturingthe sample; capturing cell-free methylated DNA using a binder selectivefor methylated polynucleotides; sequencing the captured cell-freemethylated DNA; comparing the sequences of the captured cell-freemethylated DNA to control cell-free methylated DNAs sequences fromhealthy and cancerous individuals; identifying the presence of DNA fromcancer cells if there is a statistically significant similarity betweenone or more sequences of the captured cell-free methylated DNA andcell-free methylated DNAs sequences from cancerous individuals.

In an aspect, there is provided a method of detecting the presence ofDNA from cancer cells and identifying a cancer subtype, the methodcomprising: receiving sequencing data of cell-free methylated DNA from asubject sample; comparing the sequences of the captured cell-freemethylated DNA to control cell-free methylated DNAs sequences fromhealthy and cancerous individuals; identifying the presence of DNA fromcancer cells if there is a statistically significant similarity betweenone or more sequences of the captured cell-free methylated DNA andcell-free methylated DNAs sequences from cancerous individuals; and ifDNA from cancer cells is identified, further identifying the cancer celltissue of origin and cancer subtype based on the comparison.

In an aspect, there is provided a computer-implemented method ofdetecting the presence of DNA from cancer cells and identifying a cancersubtype, the method comprising: receiving, at least one processor,sequencing data of cell-free methylated DNA from a subject sample;comparing, at the at least one processor, the sequences of the capturedcell-free methylated DNA to control cell-free methylated DNAs sequencesfrom healthy and cancerous individuals; identifying, at the at least oneprocessor, the presence of DNA from cancer cells if there is astatistically significant similarity between one or more sequences ofthe captured cell-free methylated DNA and cell-free methylated DNAsequences from cancerous individuals and if DNA from cancer cells isidentified, further identifying the cancer cell tissue of origin andcancer subtype based on the comparison.

In an aspect, there is provided a computer program product for use inconjunction with a general-purpose computer having a processor and amemory connected to the processor, the computer program productcomprising a computer readable storage medium having a computermechanism encoded thereon, wherein the computer program mechanism may beloaded into the memory of the computer and cause the computer to carryout the method described herein.

In an aspect, there is provided a computer readable medium having storedthereon a data structure for storing the computer program productdescribed herein.

In an aspect, there is provided a device for detecting the presence ofDNA from cancer cells and identifying a cancer subtype, the devicecomprising: at least one processor; and electronic memory incommunication with the at one processor, the electronic memory storingprocessor-executable code that, when executed at the at least oneprocessor, causes the at least one processor to: receive sequencing dataof cell-free methylated DNA from a subject sample; compare the sequencesof the captured cell-free methylated DNA to control cell-free methylatedDNAs sequences from healthy and cancerous individuals; identify thepresence of DNA from cancer cells if there is a statisticallysignificant similarity between one or more sequences of the capturedcell-free methylated DNA and cell-free methylated DNA sequences fromcancerous individuals and if DNA from cancer cells is identified,further identify the cancer cell tissue of origin and cancer subtypebased on the comparison.

In an aspect, there is provided a method of detecting the presence ofDNA from cancer cells and determining the location of the cancer fromwhich the cancer cells arose from two or more possible organs, themethod comprising: providing a sample of cell-free DNA from a subject;capturing cell-free methylated DNA from said sample, using a binderselective for methylated polynucleotides; sequencing the capturedcell-free methylated DNA; comparing the sequence patterns of thecaptured cell-free methylated DNA to DNAs sequence patterns of two ormore population(s) of control individuals, each of said two or morepopulations having localized cancer in a different organ; determining asto which organ the cancer cells arose on the basis of a statisticallysignificant similarity between the pattern of methylation of thecell-free DNA and one of said two or more populations.

BRIEF DESCRIPTION OF FIGURES

These and other features of the preferred embodiments of the inventionwill become more apparent in the following detailed description in whichreference is made to the appended drawings wherein:

FIG. 1 shows methylome analysis of cfDNA is a highly sensitive approachto enrich and detect ctDNA in low amounts of input DNA. FIG. 1A shows acomputer simulation of the probability to detect at least oneepimutation as a function of the concentration of ctDNA (columns),number of DMRs being investigated (rows), and the sequencing depth(x-axis). FIG. 1B shows genome-wide Pearson correlation between DNAmethylation signal for 1 to 100 ng of input DNA from HCT116 cell linefragmented to mimic plasma cfDNA. Each concentration has two biologicalreplicates. FIG. 1C shows a DNA methylation profile obtained fromcfMeDIP-seq from different concentrations of input DNA from HCT116(Green Tracks) plus RRBS (Reduced Representation Bisulfite Sequencing)HCT116 data obtained from ENCODE (ENCSR000DFS) and WGBS (Whole-GenomeBisulfite Sequencing) HCT116 data obtained from GEO (GSM1465024). Forthe heatmap (RRBS track), yellow means methylated, blue meansunmethylated and gray means no coverage. FIG. 1D and FIG. 1E showresults of serial dilution of the CRC cell line HCT116 into the MultipleMyeloma (MM) cell line MM1.S. cfMeDIP-seq was performed in pure HCT116DNA (100% CRC), pure MM1.S DNA (100% MM) and 10%, 1%, 0.1%, 0.01%, and0.001% CRC DNA diluted into MM DNA. All DNA was fragmented to mimicplasma cfDNA. We observed an almost perfect linear correlation (r²=0.99,p<0.0001) between the observed versus expected (FIG. 1D) numbers of DMRsand (FIG. 1E) the DNA methylation signal (in RPKM) within those DMRs.FIG. 1F illustrates that in the same dilution series, known somaticmutations are only detectable at 1/100 allele fraction by ultra-deep(>10,000×) targeted sequencing, above the background sequencer andpolymerase error rate. Shown are the fractions of reads containing eachbase or an insertion/deletion at the site of each mutation in the CRCcell line. FIG. 1G depicts a bar graph showing frequency of ctDNA(human) as a percentage of total cfDNA (human+mice) in the plasma ofmice harboring patient-derived xenograft (PDX) from two colorectalcancer patients.

FIG. 2 shows the methylome analysis of plasma cfDNA allows tumorclassification. FIG. 2A illustrates a schematic demonstrating theapproach of machine learning classifier construction for cancerclassification. FIG. 2B depicts a heatmap of DMRs contained within themulti-class elastic net machine learning classifiers. The classifierswere trained on plasma DNA samples from healthy donors (n=24), lungcancer (n=25), breast cancer (n=25), colorectal cancer (n=23), acutemyelogenous leukemia (AML) (n=28), and glioblasatoma multiforme (GBM)(n=71). Hierarchical clustering method: Ward. FIG. 2C shows 2Dvisualizations by tSNE (t-Distributed Stochastic Neighbor Embedding) ofthe cancer-type associated DMRs identified in 10% or 25% of models. FIG.2D depicts a plot showing metrics for the plasma cfDNA methylation-basedmulti-cancer classifier. Area under the receiver operator curve (auROC)shown on the y-axis for each cancer type and healthy donors following50-fold generation of elastic net machine learning classifiers.

FIG. 3 shows validation of the multi-cancer classifier on independentcohorts. In FIG. 3A, ROC curves are shown for independent validation ofthe multi-cancer classifier on cohorts of lung cancer (LUC) (n=55 LUC vsn=97 other), AML (n=35 AML vs n=117 other), and healthy donors (n=62healthy donors vs n=90 other). In FIG. 3B, ROC curves are shown forindependent validation of the multi-cancer classifier on early stage LUC(n=32 stage I-II LUC vs n=97 other) and late stage LUC (n=23 stageIII-IV LUC vs n=97 other).

FIG. 4 shows the methylome analysis of plasma cfDNA allows tumor subtypeclassification. FIG. 4A shows 2D visualizations by tSNE (t-DistributedStochastic Neighbor Embedding) of cancer subtype associated DMRs. Breastcancer subtypes show ability to distinguish between patients harboringtumors with distinct gene expression pattern and transcription factoractivity (ER status) as well as distinct tumor copy number aberrations(HER2 status). AML subtypes show ability to distinguish between patientsharboring tumors with distinct rearrangements (FLT3 status).Glioblastoma multiforme (GBM) subtypes show ability to distinguishbetween patients harboring tumors with distinct point mutations (IDHgene mutational status). Lung cancer subtypes show ability todistinguish between patients harboring tumors with distinct histologiesthat have prognostic and therapeutic implications (adenocarcinoma vs.squamous carcinoma vs. small cell carcinoma). FIG. 4B depicts a heatmapshowing the top DMRs that allow accurate discrimination of the threebreast cancer subtypes in breast cancer plasma samples. FIG. 4C depictsa heatmap showing the top DMRs that allow accurate discrimination of theFLT3-ITD status in AML patient plasma samples. FIG. 4D depicts a heatmapshowing the top DMRs that allow accurate discrimination of the IDH genemutational status in glioblastoma multiforme (GBM) patient plasmasamples. FIG. 4E depicts a heatmap showing the top DMRs that allowaccurate discrimination of the three lung cancer histologies in lungcancer plasma samples.

FIG. 5 shows a suitable configured computer device, and associatedcommunications networks, devices, software and firmware to provide aplatform for enabling one or more embodiments as described herein.

FIG. 6 shows sequencing saturation analysis and quality controls. FIG.6A, FIG. 6B, FIG. 6C, FIG. 6D, and FIG. 6E, show the results of thesaturation analysis from the Bioconductor package MEDIPS analyzingcfMeDIP-seq data from each replicate for each input concentration fromthe HCT116 DNA fragmented to mimic plasma cfDNA. FIG. 6F is a graphshowing the results of the protocol tested in two replicates of fourstarting DNA concentrations (100, 10, 5, and 1 ng) of HCT116 cell line.Specificity of the reaction was calculated using methylated andunmethylated spiked-in A. thaliana DNA. Fold enrichment ratio wascalculated using genomic regions of the fragmented HCT116 DNA (Primersfor methylated testis-specific H2B, TSH2B0 and unmethylated human DNAregion (GAPDH promoter)). The horizontal dotted line indicates afold-enrichment ratio threshold of 25. Error bars represent ±1 s.e.m.FIG. 6G depicts a bar graph showing CpG Enrichment Scores of thesequenced samples show a robust enrichment of CpGs within the genomicregions from the immunoprecipitated samples compared to the inputcontrol. The CpG Enrichment Score was obtained by dividing the relativefrequency of CpGs of the regions by the relative frequency of CpGs ofthe human genome. Error bars represent ±1 s.e.m.

DETAILED DESCRIPTION

In the following description, numerous specific details are set forth toprovide a thorough understanding of the invention. However, it isunderstood that the invention may be practiced without these specificdetails.

DNA methylation profiles are cell-type specific and are disrupted incancer. Using a robust and sensitive method designed for methylomeanalysis of minute amounts of circulating cell-free DNA (cfDNA), weidentified thousands of Differentially Methylated Regions (DMRs) thatdistinguish multiple tumor types from each other and from healthyindividuals. Methylome analysis of cfDNA is highly sensitive andsuitable for detecting circulating tumor DNA (ctDNA) in early stagepatients. A machine-learning derived classifier using cfDNA methylomeswas able to correctly classify 196 plasma samples from patients with 5cancer types and healthy donors based on cross-validation. In anindependent validation, using the same DMRs identified in the plasmacfDNA, the classifier was able to correctly classify AML, lung cancer,and healthy donors, as well as both early and late stage lung cancer.Therefore, methylome analysis of cfDNA can be used for non-invasiveearly stage detection of ctDNA and robustly classify cancer types.

In an aspect, there is provided a method of detecting the presence ofDNA from cancer cells in a subject comprising: providing a sample ofcell-free DNA from a subject; subjecting the sample to librarypreparation to permit subsequent sequencing of the cell-free methylatedDNA; adding a first amount of filler DNA to the sample, wherein at leasta portion of the filler DNA is methylated, then optionally denaturingthe sample; capturing cell-free methylated DNA using a binder selectivefor methylated polynucleotides; sequencing the captured cell-freemethylated DNA; comparing the sequences of the captured cell-freemethylated DNA to control cell-free methylated DNAs sequences fromhealthy and cancerous individuals; identifying the presence of DNA fromcancer cells if there is a statistically significant similarity betweenone or more sequences of the captured cell-free methylated DNA andcell-free methylated DNAs sequences from cancerous individuals.

Applicant's co-owned applications U.S. Provisional Patent ApplicationNo. 62/331,070 filed on May 3, 2016 and International Patent ApplicationNo. PCT/CA2017/000108 filed on May 3, 2017 describe method for capturingcell-free methylated DNA and are incorporated herein by reference.

Cancer has been traditionally classified by tissue of origin—forinstance, colorectal cancer, breast cancer, lung cancer, etc. In themodern practice of clinical oncology, it is becoming increasinglyimportant to be able to distinguish subtypes of cancer by variousmolecular, developmental, and functional underpinnings. Therapeuticdecisions often hinge on the precise subtype of cancer, and it may benecessary for clinicians to identify the subtype prior to initiation oftherapy. Examples of cancer subtyping that may influence therapeuticdecisions include (but are not limited to) stage (e.g., early stage lungcancer treated with surgery vs late stage lung cancer treated withchemotherapy), histology (e.g., small cell carcinoma vs adenocarcinomavs squamous cell carcinoma in lung cancer), gene expression pattern ortranscription factor activity (e.g., ER status in breast cancer), copynumber aberrations (e.g., HER2 status in breast cancer), specificrearrangements (e.g., FLT3 in AML), specific gene point mutationalstatus (e.g., IDH gene point mutations), and DNA methylation patterns(e.g., MGMT gene promoter methylation in brain cancer).

The methods described herein are applicable to a wide variety ofcancers, including but not limited to adrenal cancer, anal cancer, bileduct cancer, bladder cancer, bone cancer, brain/cns tumors, breastcancer, castleman disease, cervical cancer, colon/rectum cancer,endometrial cancer, esophagus cancer, ewing family of tumors, eyecancer, gallbladder cancer, gastrointestinal carcinoid tumors,gastrointestinal stromal tumor (gist), gestational trophoblasticdisease, hodgkin disease, kaposi sarcoma, kidney cancer, laryngeal andhypopharyngeal cancer, leukemia (acute lymphocytic, acute myeloid,chronic lymphocytic, chronic myeloid, chronic myelomonocytic), livercancer, lung cancer (non-small cell, small cell, lung carcinoid tumor),lymphoma, lymphoma of the skin, malignant mesothelioma, multiplemyeloma, myelodysplastic syndrome, nasal cavity and paranasal sinuscancer, nasopharyngeal cancer, neuroblastoma, non-hodgkin lymphoma, oralcavity and oropharyngeal cancer, osteosarcoma, ovarian cancer, penilecancer, pituitary tumors, prostate cancer, retinoblastoma,rhabdomyosarcoma, salivary gland cancer, sarcoma—adult soft tissuecancer, skin cancer (basal and squamous cell, melanoma, merkel cell),small intestine cancer, stomach cancer, testicular cancer, thymuscancer, thyroid cancer, uterine sarcoma, vaginal cancer, vulvar cancer,waldenstrom macroglobulinemia, wilms tumor.

Various sequencing techniques are known to the person skilled in theart, such as polymerase chain reaction (PCR) followed by Sangersequencing. Also available are next-generation sequencing (NGS)techniques, also known as high-throughput sequencing, which includesvarious sequencing technologies including: Illumina (Solexa) sequencing,Roche 454 sequencing, Ion torrent: Proton/PGM sequencing, SOLiDsequencing. NGS allow for the sequencing of DNA and RNA much morequickly and cheaply than the previously used Sanger sequencing. In someembodiments, said sequencing is optimized for short read sequencing.

The term “subject” as used herein refers to any member of the animalkingdom, preferably a human being and most preferably a human being thathas, has had, or is suspected of having prostate cancer.

Cell-free methylated DNA is DNA that is circulating freely in the bloodstream, and are methylated at various known regions of the DNA. Samples,for example, plasma samples can be taken to analyze cell-free methylatedDNA. Accordingly, in some embodiments, the sample is the subject's bloodor plasma.

As used herein, “library preparation” includes list end-repair,A-tailing, adapter ligation, or any other preparation performed on thecell free DNA to permit subsequent sequencing of DNA.

As used herein, “filler DNA” can be noncoding DNA or it can consist ofamplicons.

DNA samples may be denatured, for example, using sufficient heat.

In some embodiments, the comparison step is based on fit using astatistical classifier. Statistical classifiers using DNA methylationdata can be used for assigning a sample to a particular disease state,such as cancer type or subtype. For the purpose of cancer type orsubtype classification, a classifier would consist of one or more DNAmethylation variables (i.e., features) within a statistical model, andthe output of the statistical model would have one or more thresholdvalues to distinguish between distinct disease states. The particularfeature(s) and threshold value(s) that are used in the statisticalclassifier can be derived from prior knowledge of the cancer types orsubtypes, from prior knowledge of the features that are likely to bemost informative, from machine learning, or from a combination of two ormore of these approaches.

In some embodiments, the classifier is machine learning-derived.Preferably, the classifier is an elastic net classifier, lasso, supportvector machine, random forest, or neural network.

The genomic space that is analyzed can be genome-wide, or preferablyrestricted to regulatory regions (i.e., FANTOM5 enhancers, CpG Islands,CpG shores and CpG Shelves).

Preferably, the percentage of spike-in methylated DNA recovered isincluded as a covariate to control for pulldown efficiency variation.

For a classifier capable of distinguishing multiple cancer types (orsubtypes) from one another, the classifier would preferably consist ofdifferentially methylated regions from pairwise comparisons of each type(or subtype) of interest.

In some embodiments, the control cell-free methylated DNAs sequencesfrom healthy and cancerous individuals are comprised in a database ofDifferentially Methylated Regions (DMRs) between healthy and cancerousindividuals.

In some embodiments, the control cell-free methylated DNA sequences fromhealthy and cancerous individuals are limited to those control cell-freemethylated DNA sequences which are differentially methylated as betweenhealthy and cancerous individuals in DNA derived from cell-free DNA frombodily fluids, such as from blood serum, cerebral spinal fluid, urinestool, sputum, pleural fluid, ascites, tears, sweat, pap smear fluid,endoscopy brushings fluid, . . . etc., preferably from blood plasma.

In some embodiments, the sample has less than 100 ng, 75 ng, or 50 ng ofcell-free DNA.

In some embodiments, the first amount of filler DNA comprises about 5%,10%, 15%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, or 100% methylatedfiller DNA with remainder being unmethylated filler DNA, and preferablybetween 5% and 50%, between 10%-40%, or between 15%-30% methylatedfiller DNA.

In some embodiments, the first amount of filler DNA is from 20 ng to 100ng, preferably 30 ng to 100 ng, more preferably 50 ng to 100 ng.

In some embodiments, the cell-free DNA from the sample and the firstamount of filler DNA together comprises at least 50 ng of total DNA,preferably at least 100 ng of total DNA.

In some embodiments, he filler DNA is 50 bp to 800 bp long, preferably100 bp to 600 bp long, and more preferably 200 bp to 600 bp long.

In some embodiments, the filler DNA is double stranded. The filler DNAis double stranded. For example, the filler DNA can be junk DNA. Thefiller DNA may also be endogenous or exogenous DNA. For example, thefiller DNA is non-human DNA, and in preferred embodiments, DNA. As usedherein, “λ DNA” refers to Enterobacteria phage λ DNA. In someembodiments, the filler DNA has no alignment to human DNA.

In some embodiments, the binder is a protein comprising aMethyl-CpG-binding domain. One such exemplary protein is MBD2 protein.As used herein, “Methyl-CpG-binding domain (MBD)” refers to certaindomains of proteins and enzymes that is approximately 70 residues longand binds to DNA that contains one or more symmetrically methylatedCpGs. The MBD of MeCP2, MBD1, MBD2, MBD4 and BAZ2 mediates binding toDNA, and in cases of MeCP2, MBD1 and MBD2, preferentially to methylatedCpG. Human proteins MECP2, MBD1, MBD2, MBD3, and MBD4 comprise a familyof nuclear proteins related by the presence in each of amethyl-CpG-binding domain (MBD). Each of these proteins, with theexception of MBD3, is capable of binding specifically to methylated DNA.

In other embodiments, the binder is an antibody and capturing cell-freemethylated DNA comprises immunoprecipitating the cell-free methylatedDNA using the antibody. As used herein, “immunoprecipitation” refers atechnique of precipitating an antigen (such as polypeptides andnucleotides) out of solution using an antibody that specifically bindsto that particular antigen. This process can be used to isolate andconcentrate a particular protein or DNA from a sample and requires thatthe antibody be coupled to a solid substrate at some point in theprocedure. The solid substrate includes for examples beads, such asmagnetic beads. Other types of beads and solid substrates are known inthe art.

One exemplary antibody is 5-MeC antibody. For the immunoprecipitationprocedure, in some embodiments at least 0.05 μg of the antibody is addedto the sample; while in more preferred embodiments at least 0.16 μg ofthe antibody is added to the sample. To confirm the immunoprecipitationreaction, in some embodiments the method described herein furthercomprises the step of adding a second amount of control DNA to thesample.

In some embodiments, the method further comprises the step of adding asecond amount of control DNA to the sample for confirming theimmunoprecipitation reaction.

As used herein, the “control” may comprise both positive and negativecontrol, or at least a positive control.

In some embodiments, the method further comprises the step of adding asecond amount of control DNA to the sample for confirming the capture ofcell-free methylated DNA.

In some embodiments, identifying the presence of DNA from cancer cellsfurther includes identifying the cancer cell tissue of origin.

In some instances, tumor tissue sampling may be challenging or carrysignificant risks, in which case diagnosing and/or subtyping the cancerwithout the need for tumor tissue sampling may be desired. For example,lung tumor tissue sampling may require invasive procedures such asmediastinoscopy, thoracotomy, or percutaneous needle biopsy; theseprocedures may result in a need for hospitalization, chest tube,mechanical ventilation, antibiotics, or other medical interventions.Some individuals may not undergo the invasive procedures needed fortumor tissue sampling either because of medical comorbidities or due topreference. In some instances, the actual procedure for tumor tissueprocurement may depend on the suspected cancer subtype. In otherinstances, cancer subtype may evolve over time within the sameindividual; serial assessment with invasive tumor tissue samplingprocedures is often impractical and not well tolerated by patients.Thus, non-invasive cancer subtyping via blood test could have manyadvantageous applications in the practice of clinical oncology.

Accordingly, in some embodiments, identifying the cancer cell tissue oforigin further includes identifying a cancer subtype. Preferably, thecancer subtype differentiates the cancer based on stage (e.g., earlystage lung cancer treated with surgery vs late stage lung cancer treatedwith chemotherapy), histology (e.g., small cell carcinoma vsadenocarcinoma vs squamous cell carcinoma in lung cancer), geneexpression pattern or transcription factor activity (e.g., ER status inbreast cancer), copy number aberrations (e.g., HER2 status in breastcancer), specific rearrangements (e.g., FLT3 in AML), specific genepoint mutational status (e.g., IDH gene point mutations), and DNAmethylation patterns (e.g., MGMT gene promoter methylation in braincancer).

In some embodiments, comparison in step (f) is carried out genome-wide.

In other embodiments, the comparison in step (f) is restricted fromgenome-wide to specific regulatory regions, such as, but not limited to,FANTOM5 enhancers, CpG Islands, CpG shores, CpG Shelves, or anycombination of the foregoing.

In some embodiments, certain steps are carried out by a computerprocessor.

In an aspect, there is provided a method of detecting the presence ofDNA from cancer cells and identifying a cancer subtype, the methodcomprising: receiving sequencing data of cell-free methylated DNA from asubject sample; comparing the sequences of the captured cell-freemethylated DNA to control cell-free methylated DNAs sequences fromhealthy and cancerous individuals; identifying the presence of DNA fromcancer cells if there is a statistically significant similarity betweenone or more sequences of the captured cell-free methylated DNA andcell-free methylated DNAs sequences from cancerous individuals; and ifDNA from cancer cells is identified, further identifying the cancer celltissue of origin and cancer subtype based on the comparison step.

In an aspect, there is provided a method of detecting the presence ofDNA from cancer cells and determining the location of the cancer fromwhich the cancer cells arose from two or more possible organs, themethod comprising: providing a sample of cell-free DNA from a subject;capturing cell-free methylated DNA from said sample, using a binderselective for methylated polynucleotides; sequencing the capturedcell-free methylated DNA; comparing the sequence patterns of thecaptured cell-free methylated DNA to DNAs sequence patterns of two ormore population(s) of control individuals, each of said two or morepopulations having localized cancer in a different organ; determining asto which organ the cancer cells arose on the basis of a statisticallysignificant similarity between the pattern of methylation of thecell-free DNA and one of said two or more populations.

The present system and method may be practiced in various embodiments. Asuitably configured computer device, and associated communicationsnetworks, devices, software and firmware may provide a platform forenabling one or more embodiments as described above. By way of example,FIG. 5 shows a generic computer device 100 that may include a centralprocessing unit (“CPU”) 102 connected to a storage unit 104 and to arandom access memory 106. The CPU 102 may process an operating system101, application program 103, and data 123. The operating system 101,application program 103, and data 123 may be stored in storage unit 104and loaded into memory 106, as may be required. Computer device 100 mayfurther include a graphics processing unit (GPU) 122 which isoperatively connected to CPU 102 and to memory 106 to offload intensiveimage processing calculations from CPU 102 and run these calculations inparallel with CPU 102. An operator 107 may interact with the computerdevice 100 using a video display 108 connected by a video interface 105,and various input/output devices such as a keyboard 115, mouse 112, anddisk drive or solid state drive 114 connected by an I/O interface 109.In known manner, the mouse 112 may be configured to control movement ofa cursor in the video display 108, and to operate various graphical userinterface (GUI) controls appearing in the video display 108 with a mousebutton. The disk drive or solid state drive 114 may be configured toaccept computer readable media 116. The computer device 100 may formpart of a network via a network interface 111, allowing the computerdevice 100 to communicate with other suitably configured data processingsystems (not shown). One or more different types of sensors 135 may beused to receive input from various sources.

The present system and method may be practiced on virtually any mannerof computer device including a desktop computer, laptop computer, tabletcomputer or wireless handheld. The present system and method may also beimplemented as a computer-readable/useable medium that includes computerprogram code to enable one or more computer devices to implement each ofthe various process steps in a method in accordance with the presentinvention. In case of more than computer devices performing the entireoperation, the computer devices are networked to distribute the varioussteps of the operation. It is understood that the termscomputer-readable medium or computer useable medium comprises one ormore of any type of physical embodiment of the program code. Inparticular, the computer-readable/useable medium can comprise programcode embodied on one or more portable storage articles of manufacture(e.g. an optical disc, a magnetic disk, a tape, etc.), on one or moredata storage portioned of a computing device, such as memory associatedwith a computer and/or a storage system.

In an aspect, there is provided a computer-implemented method ofdetecting the presence of DNA from cancer cells and identifying a cancersubtype, the method comprising: receiving, at least one processor,sequencing data of cell-free methylated DNA from a subject sample;comparing, at the at least one processor, the sequences of the capturedcell-free methylated DNA to control cell-free methylated DNAs sequencesfrom healthy and cancerous individuals; identifying, at the at least oneprocessor, the presence of DNA from cancer cells if there is astatistically significant similarity between one or more sequences ofthe captured cell-free methylated DNA and cell-free methylated DNAssequences from cancerous individuals and if DNA from cancer cells isidentified, further identifying the cancer cell tissue of origin andcancer subtype based on the comparison step;

In an aspect, there is provided a computer program product for use inconjunction with a general-purpose computer having a processor and amemory connected to the processor, the computer program productcomprising a computer readable storage medium having a computermechanism encoded thereon, wherein the computer program mechanism may beloaded into the memory of the computer and cause the computer to carryout the method described herein.

In an aspect, there is provided a computer readable medium having storedthereon a data structure for storing the computer program productdescribed herein.

In an aspect, there is provided a device for detecting the presence ofDNA from cancer cells and identifying a cancer subtype, the devicecomprising: at least one processor; and electronic memory incommunication with the at one processor, the electronic memory storingprocessor-executable code that, when executed at the at least oneprocessor, causes the at least one processor to: receive sequencing dataof cell-free methylated DNA from a subject sample; compare the sequencesof the captured cell-free methylated DNA to control cell-free methylatedDNAs sequences from healthy and cancerous individuals; identify thepresence of DNA from cancer cells if there is a statisticallysignificant similarity between one or more sequences of the capturedcell-free methylated DNA and cell-free methylated DNAs sequences fromcancerous individuals and if DNA from cancer cells from is identified,further identify the cancer cell tissue of origin and cancer subtypebased on the comparison step.

As used herein, “processor” may be any type of processor, such as, forexample, any type of general-purpose microprocessor or microcontroller(e.g., an Intel™ x86, PowerPC™, ARM™ processor, or the like), a digitalsignal processing (DSP) processor, an integrated circuit, a fieldprogrammable gate array (FPGA), or any combination thereof.

As used herein “memory” may include a suitable combination of any typeof computer memory that is located either internally or externally suchas, for example, random-access memory (RAM), read-only memory (ROM),compact disc read-only memory (CDROM), electro-optical memory,magneto-optical memory, erasable programmable read-only memory (EPROM),and electrically-erasable programmable read-only memory (EEPROM), or thelike. Portions of memory 102 may be organized using a conventionalfilesystem, controlled and administered by an operating system governingoverall operation of a device.

As used herein, “computer readable storage medium” (also referred to asa machine-readable medium, a processor-readable medium, or a computerusable medium having a computer-readable program code embodied therein)is a medium capable of storing data in a format readable by a computeror machine. The machine-readable medium can be any suitable tangible,non-transitory medium, including magnetic, optical, or electricalstorage medium including a diskette, compact disk read only memory(CD-ROM), memory device (volatile or non-volatile), or similar storagemechanism. The computer readable storage medium can contain various setsof instructions, code sequences, configuration information, or otherdata, which, when executed, cause a processor to perform steps in amethod according to an embodiment of the disclosure. Those of ordinaryskill in the art will appreciate that other instructions and operationsnecessary to implement the described implementations can also be storedon the computer readable storage medium. The instructions stored on thecomputer readable storage medium can be executed by a processor or othersuitable processing device, and can interface with circuitry to performthe described tasks.

As used herein, “data structure” a particular way of organizing data ina computer so that it can be used efficiently. Data structures canimplement one or more particular abstract data types (ADT), whichspecify the operations that can be performed on a data structure and thecomputational complexity of those operations. In comparison, a datastructure is a concrete implementation of the specification provided byan ADT.

The advantages of the present invention are further illustrated by thefollowing examples. The examples and their particular details set forthherein are presented for illustration only and should not be construedas a limitation on the claims of the present invention.

Examples

Methods and Materials

Donor Recruitment and Sample Acquisition

CRC, Breast cancer, and GBM samples were obtained from the UniversityHealth Network BioBank; AML samples were obtained from the UniversityHealth Network Leukemia BioBank; Lastly, healthy controls were recruitedthrough the Family Medicine Centre at Mount Sinai Hospital (MSH) inToronto, Canada. All samples collected with patient consent, wereobtained with institutional approval from the Research Ethics Board,from University Health Network and Mount Sinai Hospital in Toronto,Canada.

Specimen Processing—cfDNA

EDTA and ACD plasma samples were obtained from the BioBanks and from theFamily Medicine Centre at Mount Sinai Hospital (MSH) in Toronto, Canada.All samples were either stored at −80° C. or in vapour phase liquidnitrogen until use. Cell-free DNA was extracted from 0.5-3.5 ml ofplasma using the QlAamp Circulating Nucleic Acid Kit (Qiagen). Theextracted DNA was quantified through Qubit prior to use.

Specimen Processing—PDX cfDNA

Human colorectal tumor tissue obtained with patient consent from theUniversity Health Network Biobank as approved by the Research EthicsBoard at University Health Network, was digested to single cells usingcollagenase A. Single cells were subcutaneously injected into 4-6 weekold NOD/SCID male mouse. Mice were euthanized by CO2 inhalation prior toblood collection by cardiac puncture and stored in EDTA tubes. From thecollected blood samples, the plasma was isolated and stored at −80 C.Cell-free DNA was extracted from 0.3-0.7 ml of plasma using the QIAampCirculating Nucleic Acid Kit (Qiagen). All animal work was carried outin compliance with the ethical regulations approved by the Animal CareCommittee at University Health Network.

cfMeDIP-seq

A schematic representation of the cfMeDIP-seq protocol is shown inWO2017/190215. Prior to cfMeDIP, the DNA samples were subjected tolibrary preparation using the Kapa Hyper Prep Kit (Kapa Biosystems). Themanufacturer protocol was followed with some modifications. Briefly, theDNA of interest was added to 0.2 mL PCR tube and subjected to end-repairand A-Tailing. Adapter ligation was followed using NEBNext adapter (fromthe NEBNext Multiplex Oligos for Illumina kit, New England Biolabs) at afinal concentration of 0.181 μM, incubated at 20° C. for 20 mins andpurified with AMPure XP beads. The eluted library was digested using theUSER enzyme (New England Biolabs Canada) followed by purification withQiagen MinElute PCR Purification Kit prior to MeDIP.

The prepared libraries were combined with the pooledmethylated/unmethylated PCR product to a final DNA amount of 100 ng andsubjected to MeDIP using the protocol from Taiwo et al. 2012[7] withsome modifications. Briefly, for MeDIP, the Diagenode MagMeDIP kit (Cat#C02010021) was used following the manufacturer's protocol with somemodifications. After the addition of 0.3 ng of the control methylatedand 0.3 ng of the control unmethylated A. thaliana DNA, the filler DNA(to complete the total amount of DNA [cfDNA+Filler+Controls] to 100 ng)and the buffers to the PCR tubes containing the adapter ligated DNA, thesamples were heated to 95° C. for 10 mins, then immediately placed intoan ice water bath for 10 mins. Each sample was partitioned into two 0.2mL PCR tubes: one for the 10% input control and the other one for thesample to be subjected to immunoprecipitation. The included 5-mCmonoclonal antibody 33D3 (Cat #C15200081) from the MagMeDIP kit wasdiluted 1:15 prior to generating the diluted antibody mix and added tothe sample. Washed magnetic beads (following manufacturer instructions)were also added prior to incubation at 4° C. for 17 hours. The sampleswere purified using the Diagenode iPure Kit and eluted in 50 μl ofBuffer C. The success of the reaction (QC1) was validated through qPCRto detect the presence of the spiked-in A. thaliana DNA, ensuring a %recovery of unmethylated spiked-in DNA <1% and the % specificity of thereaction >99% (as calculated by 1−[recovery of spiked-in unmethylatedcontrol DNA over recovery of spiked-in methylated control DNA]), priorto proceeding to the next step. The optimal number of cycles to amplifyeach library was determined through the use of qPCR, after which thesamples were amplified using the KAPA HiFi Hotstart Mastermix and theNEBNext multiplex oligos added to a final concentration of 0.3 μM. ThePCR settings used to amplify the libraries were as follows: activationat 95° C. for 3 min, followed by predetermined cycles of 98° C. for 20sec, 65° C. for 15 sec and 72° C. for 30 sec and a final extension of72° C. for 1 min. The amplified libraries were purified using MinElutePCR purification column and then gel size selected with 3% Nusieve GTGagarose gel to remove any adapter dimers. Prior to submission forsequencing, the fold enrichment of a methylated human DNA region(testis-specific H2B, TSH2B) and an unmethylated human DNA region (GAPDHpromoter) was determined for the MeDIP-seq and cfMeDIP-seq librariesgenerated from the HCT116 cell line DNA sheared to mimic cell free DNA(Cell line obtained from ATCC, mycoplasma free). The final librarieswere submitted for BioAnalyzer analysis prior to sequencing at the UHNPrincess Margaret Genomic Centre on an Illumina HiSeq 2000.

Ultra-Deep Targeted Sequencing for Point Mutation Detection

We used the QlAgen Circulating Nucleic Acid kit to isolate cell-free DNAfrom ˜20 mL of plasma (4-5×10 mL EDTA blood tubes) from patients withmatched tumor tissue molecular profiling data generated prior toenrolment in early phase clinical trials at the Princess Margaret CancerCentre. DNA was extracted from cell lines (dilution of CRC and MM celllines) using the PureGene Gentra kit, fragmented to ˜180 bp using aCovaris sonicator, and larger size fragments excluded using Ampure beadsto mimic the fragment size of cell-free DNA. DNA sequencing librarieswere constructed from 83 ng of fragmented DNA using the KAPA Hyper PrepKit (Kapa Biosystems, Wilmington, Mass.) utilizing NEXTflex-96 DNABarcode adapters (Bio Scientific, Austin, Tex.) adapters. To isolate DNAfragments containing known mutations, we designed biotinylated DNAcapture probes (xGen Lockdown Custom Probes Mini Pool, Integrated DNATechnologies, Coralville, Iowa) targeting mutation hotspots from 48genes tested by the clinical laboratory using the Illumina TruSeqAmplicon Cancer Panel. The barcoded libraries were pooled and thenapplied the custom hybrid capture library following manufacturer'sinstructions (IDT xGEN Lockdown protocol version 2.1). These fragmentswere sequenced to >10,000× read coverage using an Illumina HiSeq 2000instrument. Resulting reads were aligned using bwa-mem and mutationsdetected using samtools and muTect version 1.1.4.

Modelling Relationships Between Number of Tumor-Specific Features andProbability of Detection by Sequencing Depth

We created 145,000 simulated genomes, with the proportion ofcancer-specific methylated DMRs set to 0.001%, 0.01%, 0.1%, 1%, and 10%and consisting of 1, 10, 100, 1000 and 10000 independent DMRsrespectively. We sampled 14,500 diploid genomes (representing 100 ng ofDNA) from these original mixtures and further sampled 10, 100, 1000, and10000 reads per locus to represent sequencing coverage at those depths.This process was repeated 100 times for each combination of coverage,abundance, and number of features. We estimated the frequency ofsuccessful detection of at least 1 DMR for each combination ofparameters and plotted probability curves (FIG. 1A) to visually evaluatethe influence of the number of features on the probability of successfuldetection conditional on sequencing depths.

Derivation of Tissue-Distinctive Features, Development of a Multi-TissueClassifier and Validation in 450 k Data

cfDNA MeDIP profiles were quantified using the MEDIPS R package[8],converted to RPKMs, and afterwards transformed into log 2counts-per-million. Subsequently, a linear model was fit usinglimma-trend[9] on a matrix of features that mapped to FANTOM5 enhancers,CpG Islands, CpG shores and CpG Shelves, with the percentage of spike-inmethylated DNA recovered included as a covariate to control for pulldownefficiency variation. Pairwise contrasts were evaluated for each pair oftissue types and the top 150 and the bottom 150 DMRs were selected forelastic net classifier training and validation of cancer-typespecificity. Performance metrics were derived by majority class votes onout-of-fold calls from the model with the highest Kappa value incross-validation, a heuristic previously employed in Chakravarthy etal[10].

Machine learning analyses for evaluation of classification accuracy

Model Training and Evaluation on the Discovery Cohort

In order to evaluate the performance of cfMeDIP data in tumorclassification without high computational cost, we reduced the initialset of possible candidate features to windows encompassing CpG Islands,shores, shelves and FANTOM5 enhancers (hereby labelled “regulatoryfeatures”), yielding a matrix of 196 samples and 505,027 features. Wethen used the caret R package to partition the discovery cohort datainto 50 independent training and test sets in an 80%-20% manner (FIG.2A). The splits were performed while class proportions across thediscovery cohort were maintained. Then, we selected the top 300 DMRs bymoderated t-statistic (150 hypermethylated, 150 hypomethylated) on thetraining data partition using limma-trend for each class versus otherclasses. A binomial GLMnet was then trained using these DMRs (up to 300DMRs×7 other classes=2100 features) with the use of 3 iterations of10-Fold Cross-Validation (CV) to optimize values of the mixing parameter(alpha, values=0, 0.2, 0.5, 0.8 and 1) and the penalty (lambda,values=0-0.05 in increments of 0.01) using Cohen's Kappa as theperformance metric. For each training set, this yielded a collection of6 one-class vs-other-classes binomial classifiers.

We then estimated classification performance on the held-out test setusing the AUROC (area under the receiver operating characteristiccurve). These estimates represent unbiased measures of classification,as the held-out test set samples were not used for either DMRpre-selection or GLMnet training and tuning. The 50 independent trainingand test sets also permitted for minimization of optimistic estimatesdue to training-set bias.

Model Evaluation on the Validation Cohort

For each validation cohort cfMeDIP sample, we estimated classprobabilities for the AML, LUC and normal one-vs-all binomialclassifiers trained on the 50 different training sets within thediscovery cohort. The probabilities from the 50 models were averaged toproduce a single score that was then used for AUROC estimation. We alsoevaluated if disease stage affected performance by estimating AUROC wheneither early (Stages I and II) or late stage LUC samples (Stages III andIV) were left out for the one-vs-all classifier.

Results and Discussion

We bioinformatically simulated mixtures with different proportions ofctDNA, from 0.001% to 10% (FIG. 1A, column facets). We also simulatedscenarios where the ctDNA had 1, 10, 100, 1000, or 10000 DMRs(Differentially Methylated Regions) as compared to normal cfDNA (FIG.1A, row facets). Reads were then sampled at varying sequencing depths ateach locus (10×, 100×, 1000×, and 10000×) (FIG. 1A, x-axis). We found anincreasing probability of detecting of at least 1 cancer-specific event(FIG. 1A) as the number of DMRs increased, even at low abundance ofcancer ctDNA and shallow coverage.

Moreover, pan-cancer data from The Cancer Genome Atlas (TCGA) showslarge numbers of DMRs between tumor and normal tissues across virtuallyall tumor types[11]. Therefore, these findings highlighted that an assaythat successfully recovered cancer-specific DNA methylation alterationsfrom ctDNA could serve as a very sensitive tool to detect, classify, andmonitor malignant disease with low sequencing-associated costs.

However, genome-wide mapping of DNA methylation in plasma cfDNA ischallenging due to the very low quantities and fragmentation of DNA incirculation[12]. As a result, previous efforts at methylation profilingof cfDNA has mainly been restricted to locus specific PCR-basedassays[2, 3], such as an FDA approved SEPT9 methylation assay forcolorectal cancer screening[13]. While recent efforts have been made toperform whole-genome bisulfate-sequencing of fragmented cfDNA[14-16],the low genome-wide abundance of CpGs is likely to reduce the amount ofuseful methylation-related information available from sequencing.Therefore, the main issues with WGBS on plasma DNA are the high cost,low efficiency, and DNA losses associated with the bisulfate conversion.On the other hand, a method that selectively enriches for CpG-richfeatures prone to methylation is likely to maximize the amount of usefulinformation available per read, decrease the cost, and decrease the DNAlosses.

A Genome-Wide Method Suitable for cfDNA Methylation Mapping

We developed a new method termed cfMeDIP-seq (cell-free Methylated DNAImmunoprecipitation and high-throughput sequencing) to performgenome-wide DNA methylation mapping using cell-free DNA. The cfMeDIP-seqmethod described here was developed through the modification of anexisting low input MeDIP-seq protocol[7] that in our experience is veryrobust down to 100 ng of input DNA. However, the majority of plasmasamples yield much less than 100 ng of DNA. To overcome this challenge,we added exogenous λ DNA (filler DNA) to the adapter-ligated cfDNAlibrary in order to artificially inflate the amount of starting DNA to100 ng. This minimizes the amount of non-specific binding by theantibody and also minimizes the amount of DNA lost due to binding toplasticware. The filler DNA consisted of amplicons similar in size to anadapter-ligated cfDNA library and was composed of unmethylated and invitro methylated DNA at different CpG densities. The addition of thisfiller DNA also serves a practical use, as different patients will yielddifferent amounts of cfDNA, allowing for the normalization of input DNAamount to 100 ng. This ensures that the downstream protocol remainsexactly the same for all samples regardless of the amount of availablecfDNA.

We first validated the cfMeDIP-seq protocol using DNA from humancolorectal cancer cell line HCT116, sheared to a fragment size similarto that observed in cfDNA. HCT116 was chosen because of the availabilityof public DNA methylation data. We simultaneously performed the goldstandard MeDIP-seq protocol[7] using 100 ng of sheared cell line DNA andthe cfMeDIP-seq protocol using 10 ng, 5 ng, and 1 ng of the same shearedcell line DNA. This was performed in two biological replicates. For allthe conditions, we obtained more than 99% specificity of the reaction(1−[recovery of spiked-in unmethylated control DNA over recovery ofspiked-in methylated control DNA]), and a very high enrichment of aknown methylated region over an unmethylated region (TSH2B0 and GAPDH,respectively) (FIG. 6F).

The libraries were sequenced to saturation (FIGS. 6A-6E) at around 30 to70 million reads per library (Supplementary Table 1). The raw reads werealigned to both the human genome and the λ genome, and found virtuallyno alignment was found to the λ genome (Supplementary Table 1).Therefore, the addition of the exogenous 2, DNA as filler DNA did notinterfere with the generation of sequencing data. Finally, we calculatethe CpG Enrichment Score as a quality control measure for theimmunoprecipitation step[8]. All the libraries showed similar enrichmentfor CpGs while the input control, as expected, showed no enrichment(FIG. 6G), validating our immunoprecipitations even at extremely lowinputs (ing).

Genome-wide correlation estimates comparing different input DNA levelsshow that both MeDIP-seq (100 ng) and cfMeDIP-seq (10, 5, and 1 ng)methods were very robust, with Pearson correlation of at least 0.94between any two biological replicates (FIG. 1B). The analysis alsodemonstrates that cfMeDIP-seq at 5 and 10 ng of input DNA can robustlyrecapitulate the methylation profile obtained by traditional MeDIP-seqat 100 ng (Pairwise Pearson correlation of at least 0.9) (FIG. 1B). Theperformance of cfMeDIP-seq at 1 ng of input DNA is reduced compared toMeDIP-seq at 100 ng but still shows a strong Pearson correlation at >0.7(FIG. 1B). We also observed that the cfMeDIP-seq protocol recapitulatesthe DNA methylation profile of HCT116 using gold standard RRBS (ReducedRepresentation Bisulfite Sequencing) and WGBS (Whole-Genome BisulfiteSequencing) (FIG. 1C). Altogether, our data suggests that cfMeDIP-seq isa robust protocol for genome-wide methylation mapping of fragmented andlow input DNA material, such as circulating cfDNA.

cfMeDIP-Seq Displays High-Sensitivity for Detection of Tumor-DerivedctDNA

To evaluate the sensitivity of the cfMeDIP-seq protocol, we performed aserial dilution of Colorectal Cancer (CRC) HCT116 cell line DNA into aMultiple Myeloma (MM) MM1.S cell line DNA, both sheared to mimic cfDNAsizes. We diluted the CRC DNA from 100%, 10%, 1%, 0.1%, 0.01%, 0.001%,to 0% and performed cfMeDIP-seq on each of these dilutions. We alsoperformed ultra-deep (10,000× median coverage) targeted sequencing fordetection of three point mutations in the same samples. The observednumber of DMRs identified at each CRC dilution point versus the pure MMDNA using a 5% False Discovery rate (FDR) threshold was almost perfectlylinear (r²=0.99, p<0.0001) with the expected number of DMRs based on thedilution factor (FIG. 1D) down to a 0.001% dilution. Moreover, the DNAmethylation signal within these DMRs also shows almost perfect linearity(r²=0.99, p<0.0001) between the observed versus expected signal (FIG.1E; Supplementary Table 2B). In comparison, beyond the 1% dilution,ultra-deep targeted sequencing could not reliably distinguish betweenthe CRC-specific variants and the spurious variants due to PCR orsequencing-errors (FIG. 1F; Supplementary Table 2A). Thus, cfMeDIP-seqdisplays excellent sensitivity for the detection of cancer-derived DNA,exceeding the performance of variant detection by ultra-deep targetedsequencing using a standard protocol.

Cancer DNA is frequently hypermethylated at CpG-rich regions[17]. SincecfMeDIP-seq specifically targets methylated CpG-rich sequences, wehypothesized that ctDNA would be preferentially enriched during theimmunoprecipitation procedure. To test this, we generatedpatient-derived xenografts (PDXs) from two colorectal cancer patientsand collected the mouse plasma. Tumor-derived human cfDNA was present atless than 1% frequency within the total cfDNA pool in the input samplesand at 2-fold greater abundance following immunoprecipitation (FIG. 1G;Supplementary Table 3). These results suggest that through biasedsequencing of ctDNA, the cfMeDIP procedure could further increase ctDNAdetection sensitivity.

Circulating Plasma cfDNA Methylation Profile can Distinguish BetweenMultiple Cancer Types and Healthy Donors

DNA methylation patterns are tissue-specific, and have been used tostratify cancer patients into clinically relevant disease subgroups inglioblastoma[18], ependymomas[6], colorectal[19], and breast[20, 21],among many other cancer types. We asked if cfDNA associated profilescould be used to identify tissues-of-origin for multiple tumor types. Tothis end, we profiled 196 samples from 5 different tumor types andnormal controls from early and late stage tumors. We used linearmodeling to identify the top 300 DMRs mapping to CpG shores, shelves,islands and FANTOM5 enhancers for each pairwise comparison, leading to atotal of 2,100 unique DMRs (FIG. 2A). Density clustering based ont-Distributed Stochastic Neighbor Embedding (tSNE)[22] of the 196 plasmasamples based on the methylation status of these features revealeddistinct clustering of samples based on tissue-of-origin and tumor types(FIG. 2B,C). Using an elastic net multi-cancer classifier fit with thesefeatures (FIG. 2A), we observed highly accurate discrimination betweendifferent tumor types (FIG. 2D).

Discrimination of Disease Subtypes

We evaluated the ability of cfDNA MeDIP profiles to discriminate betweendisease subtypes in five distinct cases—gene expression pattern (ERstatus in breast cancer), copy number aberration (HER2 status in breastcancer), rearrangement (FLT3 ITD status in AML), point mutation (IDHmutation in GBM), and finally histology in lung cancer. In each case,linear models were used to select and rank features as describedearlier. In each case, hierarchical clustering was used to evaluate thegrouping of samples. Density clustering based on t-DistributedStochastic Neighbor Embedding (tSNE)[22] based on the methylation statusof selected features revealed distinct clustering of samples based oneach of these five distinct examples of cancer subtype classification.

Detection of Cancers and Classification of Cancer Types Using MachineLearning

In order to rigorously evaluate the ability of cfMeDIP profiles todetect cancers and further classify cancer types, we then conducted aset of machine learning analyses on our discovery cohort. To allow foraccelerated computational analysis, we initially reduced our cfMeDIPdiscovery cohort to features mapping to CpG islands, shores, shelves andFANTOM5 enhancers (n=505,027 windows). We then implemented a strategy onour discovery cohort samples to derive unbiased estimates ofperformance, while accounting for training-set biases.

Herein, we split the discovery cohort into balanced training and testsets (80% training set, 20% test set). Using only the samples in thetraining set, we selected the top 300 DMRs for each class (sample type)versus other classes, based on limma-trend test statistics, and traineda series of one-versus-other-classes GLMnets using these features on thetraining set data. The training procedure consisted of 3 rounds of10-Fold Cross-Validation (CV) across a grid of values for alpha andlambda with optimisation for Cohen's Kappa. The use of multiple roundsof 10-Fold CV was motivated by a desire to leverage additionalrandomisation for more generalisable model tuning.

Performance was then evaluated using AUROC (area under the receiveroperating characteristic curve) derived from test set samples (held-outduring the DMR selection and the subsequent GLMnet training/tuningsteps). This process was repeated with 50 different splits of thediscovery cohort into training and test sets to mitigate the influenceof training-set biases. This culminated in a collection of 50 models foreach one-vs other-classes comparison (480 models in total). Hereby, werefer to this collection of models as E50.

Subsequently, we evaluated performance across batches by generating avalidation cohort of additional 152 plasma samples: AML (n=35), lungcancer (n=55) and healthy control (n=62) samples. For each class, weaveraged the class probabilities output by the models in E50, andestimated AUROC for the one class vs. all others classes (FIG. 3A). Theclassifiers showed high AUROC values for the classification of AML vsothers (0.993), LUC vs others (0.943) and normal vs others (1.000). Thisfurther confirmed the ability of cfMeDIP-seq coupled with a machinelearning approach to accurately detect and classify tumor type. Finally,we observed that the classifiers were as accurate in early stage samples(0.950) as in late stage samples (0.934) (FIG. 3B), suggested that thisapproach is applicable for cancer early detection and for detection ofcancer at both early stages and late stages.

Additional Advantages of cfDNA Methylome Profiling with cfMeDIP-Seq

The ability of cfDNA methylation patterns to accurately representtissue-of-origin also overcomes limitations of mutation-based assays,wherein specificity for tissues-of-origin may be low due to therecurrent nature of many potential driver mutations across cancers indifferent tissues[23]. Mutation based assays may also be renderedinsensitive by the clonal structure of tumors, where subclonal driversmay be harder to detect by virtue of lower abundance in ctDNA[24].Mutation based ctDNA approaches are also vulnerable to potentialconfounding by driver mutations in benign tissues, which have beenobserved[25], and documented to display evidence of positiveselection[26].

Taken together, our findings—based on the largest collection of cancercfDNA methylomes derived to date—establish cfMeDIP-seq as an efficientand cost-effective tool with the potential to influence management ofcancer and early detection. The accuracy and versatility of cfMeDIP-seqmay be useful to inform therapeutic decisions in settings whereresistance is correlated to epigenetic alterations, such as sensitivityto androgen receptor inhibition in prostate cancer[27]. The potentialopportunities for early diagnosis and screening may be particularlyevident in lung cancer, a disease in which screening has already shownclinical utility but for which existing screening tests (i.e., low doseCT scanning) has significant limitations such as ionizing radiationexposure and high false positive rate.

In conclusion, our findings underscore the utility of cfDNA methylationprofiles as a basis for non-invasive, cost-effective, sensitive, highlyaccurate early tumor detection, multi-cancer classification, and cancersubtype classification.

TABLE 1 Number of reads and mapping efficiency of sequenced MeDIP-seq(100 ng Rep 1 and Rep 2) and cfMeDIP- seq (10 ng, 5 ng and 1 ng, Rep1and Rep 2) libraries prepared using various tarring inputs of HCT116cell line DNA sheared to mimic cfDNA, to human (Hg19) genome and λgenome. Two biological replicates were used for starting input DNA. Forstarting inputs less than 100 ng, the samples were topped up withexogenous λ DNA to artificially increase the starting amount to 100 ngprior to MeDIP. # of aligned reads to Mapping efficiency to # of alignedreads Mapping efficiency Sample #of raw reads human genome (Hg19) humangenome (Hg19) to λ genome to λ genome Input 74,504,053 71,343,168 95.7612 0.00 100 ng Replicate 1 55,396,238 50,472,273 91.11 0 0.00 100 ngReplicate 2 66,569,209 60,770,277 91.29 1 0.00 10 ng Replicate 170,054,607 64,020,441 91.39 0 0.00 10 ng Replicate 2 58,297,53953,308,777 91.44 0 0.00 5 ng Replicate 1 65,845,430 60,540,743 91.94 10.00 5 ng Replicate 2 64,750,879 59,358,412 91.67 0 0.00 1 ng Replicate1 35,102,361 32,258,451 91.90 0 0.00 1 ng Replicate 2 33,881,11831,194,711 92.07 0 0.00

TABLE 2A Mean coverage of ultra-deep targetd variant sequencing usingdilution series of CRC cell line HCT116 DNA into MM cell line MM1.S DNADCS (duplex consensus Dilution (% Uncollapsed SSCS (single strandsequences) of CRC reads mean consensus sequences) mean DNA) targetcoverage mean target coverage target coverage 100 155,964 4284 655 10154,657 4877 654 1 154,419 4890 654 0.1 183,271 5674 887 0.01 238,2918068 1602 0.001 199,766 7337 1299 0.0001 187,695 6891 1181 0 216,4347721 1412

TABLE 2B Resultant observed DMRs and DNA methylation signal from thedilution series of CRC cell line HCT116 DNA into MM cell line MM1.S DNADilution (% of Observed number of Observed DNA methylation signal CRCDNA) DMRs (sum of RPKMs within DMRs) 100 111,472 645,683.90 10 1,5978,775.61 1 692 4,521.60 0.1 12 75.71 0.01 8 79.73 0.001 2 22.42

TABLE 3 Number of reads and mapping efficiency of cfMeDIP- seq librariesof PDX and Input Control samples after aligning to human (Hg19) genome #of Aligned reads Mapping # of to human genome efficiency Sample Rawreads (Hg19) to human genome Input Control 1 45,857,633 389,073 0.83Input Control 2 35,658,454 283,799 0.80 PDX 1 49,997,949 1,080,277 2.16PDX 2 34,802,767 614,988 1.77

Although preferred embodiments of the invention have been describedherein, it will be understood by those skilled in the art thatvariations may be made thereto without departing from the spirit of theinvention or the scope of the appended claims. All documents disclosedherein, including those in the following reference list, areincorporated by reference.

REFERENCE LIST

-   1. Diaz, L. A., Jr. and A. Bardelli, Liquid biopsies: genotyping    circulating tumor DNA. J Clin Oncol, 2014. 32(6): p. 579-86.-   2. Lehmann-Werman, R., et al., Identification of tissue-specific    cell death using methylation patterns of circulating DNA. Proc Natl    Acad Sci USA, 2016. 113(13): p. E1826-34.-   3. Visvanathan, K., et al., Monitoring of Serum DNA Methylation as    an Early Independent Marker of Response and Survival in Metastatic    Breast Cancer: TBCRC 005 Prospective Biomarker Study. J Clin Oncol,    2016: p. JCO2015662080.-   4. Newman, A. M., et al., An ultrasensitive method for quantitating    circulating tumor DNA with broad patient coverage. Nat Med, 2014.    20(5): p. 548-54.-   5. Aravanis, A. M., M. Lee, and R. D. Klausner, Next-Generation    Sequencing of Circulating Tumor DNA for Early Cancer Detection.    Cell, 2017. 168(4): p. 571-574.-   6. Mack, S. C., et al., Epigenomic alterations define lethal    CIMP-positive ependymomas of infancy. Nature, 2014. 506(7489): p.    445-50.-   7. Taiwo, O., et al., Methylome analysis using MeDIP-seq with low    DNA concentrations. Nat Protoc, 2012. 7(4): p. 617-36.-   8. Lienhard, M., et al., MEDIPS: genome-wide differential coverage    analysis of sequencing data derived from DNA enrichment experiments.    Bioinformatics, 2014. 30(2): p. 284-6.-   9. Law, C. W., et al., voom: Precision weights unlock linear model    analysis tools for RNA-seq read counts. Genome Biol, 2014. 15(2): p.    R29.-   10. Chakravarthy, A., et al., Human Papillomavirus Drives Tumor    Development Throughout the Head and Neck: Improved Prognosis Is    Associated With an Immune Response Largely Restricted to the    Oropharynx. J Clin Oncol, 2016. 34(34): p. 4132-4141.-   11. Hoadley, K. A., et al., Multiplatform analysis of 12 cancer    types reveals molecular classification within and across tissues of    origin. Cell, 2014. 158(4): p. 929-44.-   12. Fleischhacker, M. and B. Schmidt, Circulating nucleic acids    (CNAs) and cancer—a survey. Biochim Biophys Acta, 2007. 1775(1): p.    181-232.-   13. Potter, N. T., et al., Validation of a real-time PCR-based    qualitative assay for the detection of methylated SEPT9 DNA in human    plasma. Clin Chem, 2014. 60(9): p. 1183-91.-   14. Legendre, C., et al., Whole-genome bisulfite sequencing of cell    free DNA identifies signature associated with metastatic breast    cancer. Clin Epigenetics, 2015. 7: p. 100.-   15. Sun, K., et al., Plasma DNA tissue mapping by genome-wide    methylation sequencing for noninvasive prenatal, cancer, and    transplantation assessments. Proc Natl Acad Sci USA, 2015.    112(40): p. E5503-12.-   16. Chan, K. C., et al., Noninvasive detection of cancer-associated    genome-wide hypomethylation and copy number aberrations by plasma    DNA bisulfite sequencing. Proc Natl Acad Sci USA, 2013. 110(47): p.    18761-8.-   17. Sharma, S., T. K. Kelly, and P. A. Jones, Epigenetics in cancer.    Carcinogenesis, 2010. 31(1): p. 27-36.-   18. Sturm, D., et al., Hotspot mutations in H3F3A and IDH1 define    distinct epigenetic and biological subgroups of glioblastoma. Cancer    Cell, 2012. 22(4): p. 425-37.-   19. Hinoue, T., et al., Genome-scale analysis of aberrant DNA    methylation in colorectal cancer. Genome Res, 2012. 22(2): p.    271-82.-   20. Stirzaker, C., et al., Methylome sequencing in triple-negative    breast cancer reveals distinct methylation clusters with prognostic    value. Nat Commun, 2015. 6: p. 5899.-   21. Fang, F., et al., Breast cancer methylomes establish an    epigenomic foundation for metastasis. Sci Transl Med, 2011.    3(75): p. 75ra25.-   22. Laurens van der Maaten, G. H., Visualizing Data using t-SNE.    Journal of Machine Learning Research, 2008. 9: p. 2579-2605.-   23. Kandoth, C., et al., Mutational landscape and significance    across 12 major cancer types. Nature, 2013. 502(7471): p. 333-9.-   24. McGranahan, N., et al., Clonal status of actionable driver    events and the timing of mutational processes in cancer evolution.    Sci Transl Med, 2015. 7(283): p. 283ra54.-   25. Zauber, P., S. Marotta, and M. Sabbath-Solitare, KRAS gene    mutations are more common in colorectal villous adenomas and in situ    carcinomas than in carcinomas. Int J Mol Epidemiol Genet, 2013.    4(1): p. 1-10.-   26. Martincorena, I., et al., Tumor evolution. High burden and    pervasive positive selection of somatic mutations in normal human    skin. Science, 2015. 348(6237): p. 880-6.-   27. Beltran, H., et al., Divergent clonal evolution of    castration-resistant neuroendocrine prostate cancer. 2016. 22(3): p.    298-305.

1. A method, comprising: (a) subjecting a plurality of nucleic acidmolecules generated from a cell-free deoxynucleic acid (cfDNA) sample ofsaid subject to sequencing to yield a plurality of sequencing reads; (b)computer processing said plurality of sequencing reads to generate amethylation profile for said plurality of nucleic acid molecules; and(c) computer processing said methylation profile to determine that saidsubject has or is at risk of having said cancer at an area under thereceiver operating characteristic curve (AUROC) of at least about 94%.2. The method of claim 1, wherein said cancer is selected from the groupconsisting of lung cancer, breast cancer, colorectal cancer, acutemyelogenous leukemia, and glioblastoma multiform.
 3. The method of claim2, wherein said cancer is acute myelogenous leukemia.
 4. The method ofclaim 3, wherein said AUROC is at least about 99%.
 5. The method ofclaim 1, wherein said determining said subject has or is at risk ofhaving a type of cancer comprises determining a tissue of origin of saidcfDNA.
 6. The method of claim 1, further comprising determining saidsubject has or is at risk of having a subtype of cancer.
 7. The methodof claim 2, when said subject has or is at risk of breast cancer,further comprising determining a subtype of breast cancer, wherein saidsubtype comprises ER positive, ER negative, HER2 positive, HER2negative, or triple-negative breast cancer (TNBC).
 8. The method ofclaim 2, when said subject has or is at risk of acute myelogenousleukemia, further comprising determining a subtype of acute myelogenousleukemia, wherein said subtype comprises FLT3 negative or FLT3 positive.9. The method of claim 2, when said subject has or is at risk ofglioblastoma multiform, further comprising determining a subtype ofglioblastoma multiform, wherein said subtype comprises IDH mutationpositive or IDH mutation negative.
 10. The method of claim 2, when saidsubject has or is at risk of lung cancer, further comprising determininga subtype of lung cancer, wherein said subtype comprises adenocarcinoma,squamous carcinoma, or small cell carcinoma.
 11. The method of claim 1,further comprising generating a report that said subject does or doesnot have said cancer or is or is not at risk or having said cancer. 12.A method, comprising: (a) subjecting a plurality of nucleic acidmolecules generated from a cell-free deoxynucleic acid (cfDNA) sample ofsaid subject to sequencing to yield a plurality of sequencing reads; (b)computer processing said plurality of sequencing reads to generate amethylation profile for said plurality of nucleic acid molecules; and(c) computer processing said methylation profile to determine that saidsubject has or is at risk of having said specific stage of said cancerat an area under the receiver operating characteristic curve (AUROC) ofat least about 93%.
 13. The method of claim 12, wherein said cancer islung cancer.
 14. The method of claim 13, wherein said specific stage isan early stage of lung cancer.
 15. The method of claim 14, wherein saidAUROC is at least about 95%.
 16. The method of claim 13, wherein saidspecific stage is a late stage of lung cancer.
 17. The method of claim12, wherein said methylation profile comprises methylation levels of aplurality of differentially methylated region (DMR) of said plurality ofnucleic acid molecules.
 18. The method of claim 17, wherein said DMRcomprises hypermethylation or hypomethylation.
 19. The method of claim12, further comprising mixing said cfDNA sample with an amount of fillerDNA to generate a DNA mixture sample.
 20. The method of claim 19,wherein said DNA mixture sample comprises at least an amount of totalDNA that is at least about 50 nanograms (ng).
 21. The method of claim20, wherein said filler DNA is at least partially methylated andcomprises a length of about 50 bp to 800 bp.
 22. The method of eitherclaim 12, further comprising incubating said DNA mixture to increase arate of enrichment of at least one or more methylated regions of saidplurality of nucleic acid molecules of said cfDNA sample.
 23. The methodof claim 22, further comprising incubating said DNA mixture with abinder that is configured to bind methylated nucleotides, wherein saidbinder comprises a protein comprising a methyl-CpG-binding domain. 24.The method of claim 23, further comprising incubating said DNA mixturewith a binder that is configured to bind methylated nucleotides, whereinsaid binder comprises an antibody.
 25. The method of either claim 12,wherein computer processing said methylation profile comprises comparingto a methylation profile of a healthy subject or using a trained machinelearning algorithm.
 26. The method of claim 25, wherein said trainedmachine learning algorithm comprises a linear regression.
 27. The methodof claim 26, wherein said comparing comprises comparing said methylationprofile to said methylation profile of said healthy subject with respectto FANTOM5 enhancers, CpG islands, CpG shores, CpG shelves, or anycombination thereof.